If you like my notebook, please upvote my work!

If you use parts of this notebook in your scripts/notebooks, giving some kind of credit for instance link back to this notebook would be very much appreciated. Thanks in advance! :)

Thankyou! :) Hope you like my work!

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/digit-recognizer/train.csv
/kaggle/input/digit-recognizer/test.csv
/kaggle/input/digit-recognizer/sample_submission.csv

Importing important libraries.

In [2]:
import seaborn as sb
import plotly.express as px
import sklearn.neighbors as KNN
import plotly.graph_objects as go
import plotly.figure_factory as ff
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split

Loading the Dataset.

In [3]:
df_train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
df_train.describe()
Out[3]:
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 42000.000000 42000.0 42000.0 42000.0 42000.0 42000.0 42000.0 42000.0 42000.0 42000.0 ... 42000.000000 42000.000000 42000.000000 42000.00000 42000.000000 42000.000000 42000.0 42000.0 42000.0 42000.0
mean 4.456643 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.219286 0.117095 0.059024 0.02019 0.017238 0.002857 0.0 0.0 0.0 0.0
std 2.887730 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 6.312890 4.633819 3.274488 1.75987 1.894498 0.414264 0.0 0.0 0.0 0.0
min 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.0 0.0
25% 2.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.0 0.0
50% 4.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.0 0.0
75% 7.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.0 0.0
max 9.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 254.000000 254.000000 253.000000 253.00000 254.000000 62.000000 0.0 0.0 0.0 0.0

8 rows × 785 columns

Splitting the target and predictor variables.

In [4]:
df_train_x = df_train.drop('label',axis =1)
df_train_y = df_train[['label']]

Visualising percentage variance loss.

Fetching the variance ratios for PCA over the given dataset.

In [5]:
pca = PCA().fit(df_train_x)
pca.explained_variance_ratio_
Out[5]:
array([9.74893769e-02, 7.16026628e-02, 6.14590336e-02, 5.37930200e-02,
       4.89426213e-02, 4.30321399e-02, 3.27705076e-02, 2.89210317e-02,
       2.76690235e-02, 2.34887103e-02, 2.09932543e-02, 2.05900116e-02,
       1.70255350e-02, 1.69278702e-02, 1.58112641e-02, 1.48323962e-02,
       1.31968789e-02, 1.28272708e-02, 1.18797614e-02, 1.15275473e-02,
       1.07219122e-02, 1.01519930e-02, 9.64902259e-03, 9.12846068e-03,
       8.87640859e-03, 8.38766308e-03, 8.11855855e-03, 7.77405747e-03,
       7.40635116e-03, 6.86661489e-03, 6.57982211e-03, 6.38798611e-03,
       5.99367016e-03, 5.88913410e-03, 5.64335178e-03, 5.40967048e-03,
       5.09221943e-03, 4.87504936e-03, 4.75569422e-03, 4.66544724e-03,
       4.52952464e-03, 4.44989164e-03, 4.18255277e-03, 3.97505755e-03,
       3.84541993e-03, 3.74919479e-03, 3.61013219e-03, 3.48522166e-03,
       3.36487802e-03, 3.20738135e-03, 3.15467117e-03, 3.09145543e-03,
       2.93709181e-03, 2.86541339e-03, 2.80759437e-03, 2.69618435e-03,
       2.65831383e-03, 2.56298604e-03, 2.53821090e-03, 2.46178252e-03,
       2.39716188e-03, 2.38739578e-03, 2.27591447e-03, 2.21518444e-03,
       2.13933611e-03, 2.06133397e-03, 2.02851149e-03, 1.95976714e-03,
       1.93638614e-03, 1.88485334e-03, 1.86750994e-03, 1.81670044e-03,
       1.76891254e-03, 1.72592413e-03, 1.66120849e-03, 1.63309544e-03,
       1.60601340e-03, 1.54472396e-03, 1.46849742e-03, 1.42375935e-03,
       1.41098458e-03, 1.40228444e-03, 1.38834819e-03, 1.35417334e-03,
       1.32307196e-03, 1.30779914e-03, 1.29673803e-03, 1.24240433e-03,
       1.22249140e-03, 1.19624452e-03, 1.15840263e-03, 1.13858990e-03,
       1.12263468e-03, 1.10475151e-03, 1.08133084e-03, 1.07412608e-03,
       1.03865632e-03, 1.03321996e-03, 1.01494630e-03, 9.99965034e-04,
       9.74818997e-04, 9.45057663e-04, 9.38641819e-04, 9.12221022e-04,
       9.07313503e-04, 8.88871636e-04, 8.63699685e-04, 8.44234992e-04,
       8.35541599e-04, 8.16652064e-04, 7.87681237e-04, 7.81559589e-04,
       7.77464583e-04, 7.71933701e-04, 7.57842341e-04, 7.50215590e-04,
       7.34475822e-04, 7.25772099e-04, 7.15322651e-04, 7.00324985e-04,
       6.93049551e-04, 6.85741080e-04, 6.79933115e-04, 6.65718665e-04,
       6.56138536e-04, 6.44801786e-04, 6.35391815e-04, 6.26124921e-04,
       6.18514205e-04, 6.05741253e-04, 6.03854818e-04, 5.91452693e-04,
       5.85898805e-04, 5.84633227e-04, 5.75481878e-04, 5.69716770e-04,
       5.64497121e-04, 5.53174076e-04, 5.34344674e-04, 5.25777062e-04,
       5.21965550e-04, 5.11193757e-04, 5.05144243e-04, 4.99924977e-04,
       4.95323722e-04, 4.92348603e-04, 4.84395053e-04, 4.76686157e-04,
       4.74666022e-04, 4.67890879e-04, 4.65296674e-04, 4.61362258e-04,
       4.56336500e-04, 4.51756208e-04, 4.49488582e-04, 4.41357426e-04,
       4.38790136e-04, 4.24389337e-04, 4.20314615e-04, 4.16355388e-04,
       4.13830324e-04, 4.07099288e-04, 3.98548359e-04, 3.94359111e-04,
       3.94067067e-04, 3.91401338e-04, 3.81665552e-04, 3.79230943e-04,
       3.75536313e-04, 3.73831733e-04, 3.66337737e-04, 3.63074114e-04,
       3.59307646e-04, 3.57031988e-04, 3.53028613e-04, 3.52754592e-04,
       3.45302385e-04, 3.43276944e-04, 3.41890978e-04, 3.39165911e-04,
       3.34861511e-04, 3.29842492e-04, 3.25506437e-04, 3.24703554e-04,
       3.21013848e-04, 3.20432900e-04, 3.17194700e-04, 3.16819126e-04,
       3.10916530e-04, 3.10172960e-04, 3.06832953e-04, 3.03823171e-04,
       2.99708860e-04, 2.98327092e-04, 2.94093162e-04, 2.93391967e-04,
       2.92977220e-04, 2.89276406e-04, 2.84988966e-04, 2.83697252e-04,
       2.81082198e-04, 2.76258374e-04, 2.74103534e-04, 2.71602045e-04,
       2.67522376e-04, 2.66793948e-04, 2.62892231e-04, 2.62046588e-04,
       2.61487226e-04, 2.58253585e-04, 2.56603941e-04, 2.55387243e-04,
       2.54111610e-04, 2.52813861e-04, 2.50912463e-04, 2.48286117e-04,
       2.47615906e-04, 2.44461417e-04, 2.43286700e-04, 2.40992431e-04,
       2.40275940e-04, 2.39364452e-04, 2.38594369e-04, 2.36466045e-04,
       2.32057655e-04, 2.30987543e-04, 2.28982689e-04, 2.27286585e-04,
       2.25966858e-04, 2.24781908e-04, 2.20978300e-04, 2.19452970e-04,
       2.17383853e-04, 2.15927541e-04, 2.14951024e-04, 2.13614894e-04,
       2.11488233e-04, 2.10864069e-04, 2.08113405e-04, 2.05170657e-04,
       2.03746234e-04, 2.03271829e-04, 2.01784402e-04, 1.99666140e-04,
       1.97984405e-04, 1.96806855e-04, 1.94817073e-04, 1.93717714e-04,
       1.92666851e-04, 1.92124444e-04, 1.90249869e-04, 1.88388929e-04,
       1.86830011e-04, 1.84989673e-04, 1.84696295e-04, 1.84100838e-04,
       1.82788107e-04, 1.82323561e-04, 1.81392213e-04, 1.79612820e-04,
       1.76710787e-04, 1.75616351e-04, 1.74856503e-04, 1.73483732e-04,
       1.72524134e-04, 1.71476473e-04, 1.71133230e-04, 1.68773037e-04,
       1.68133359e-04, 1.67692755e-04, 1.66268327e-04, 1.64293501e-04,
       1.63902711e-04, 1.63026129e-04, 1.62247316e-04, 1.60434310e-04,
       1.60275519e-04, 1.58847778e-04, 1.58391347e-04, 1.57249091e-04,
       1.55421237e-04, 1.54449408e-04, 1.53626081e-04, 1.51833111e-04,
       1.50882402e-04, 1.50402146e-04, 1.49037531e-04, 1.48289379e-04,
       1.46698235e-04, 1.45918092e-04, 1.43470327e-04, 1.43413774e-04,
       1.42947318e-04, 1.41864032e-04, 1.41434477e-04, 1.40322317e-04,
       1.38342618e-04, 1.37446491e-04, 1.36378533e-04, 1.35450484e-04,
       1.35159928e-04, 1.34404079e-04, 1.33398331e-04, 1.32286985e-04,
       1.30752860e-04, 1.29890675e-04, 1.28535326e-04, 1.27848949e-04,
       1.26974506e-04, 1.26608300e-04, 1.25883934e-04, 1.25152668e-04,
       1.24026915e-04, 1.22541516e-04, 1.22323565e-04, 1.21184565e-04,
       1.20866127e-04, 1.20081813e-04, 1.19016518e-04, 1.18237887e-04,
       1.17234394e-04, 1.15741060e-04, 1.15360173e-04, 1.14560564e-04,
       1.13626694e-04, 1.13185761e-04, 1.11755777e-04, 1.10293511e-04,
       1.09960628e-04, 1.09349459e-04, 1.09176791e-04, 1.08477414e-04,
       1.07797530e-04, 1.06917463e-04, 1.06723498e-04, 1.06078717e-04,
       1.04464449e-04, 1.04396182e-04, 1.03069185e-04, 1.02214865e-04,
       1.01370888e-04, 1.00594605e-04, 9.97668874e-05, 9.94406703e-05,
       9.85829071e-05, 9.70159912e-05, 9.69784480e-05, 9.62936132e-05,
       9.46109643e-05, 9.41285257e-05, 9.30440801e-05, 9.25721602e-05,
       9.15809295e-05, 9.10792854e-05, 9.05018130e-05, 9.00300633e-05,
       8.96154897e-05, 8.89815850e-05, 8.83200096e-05, 8.75293892e-05,
       8.71255281e-05, 8.58975989e-05, 8.50483843e-05, 8.50402016e-05,
       8.46365570e-05, 8.38305712e-05, 8.27317681e-05, 8.25052090e-05,
       8.21602774e-05, 8.08585475e-05, 7.97090483e-05, 7.90024624e-05,
       7.83814628e-05, 7.80376415e-05, 7.74542297e-05, 7.66036440e-05,
       7.57515173e-05, 7.49309265e-05, 7.42332732e-05, 7.31760201e-05,
       7.30081526e-05, 7.25448153e-05, 7.22446811e-05, 7.17543752e-05,
       7.12784494e-05, 6.95487883e-05, 6.91951187e-05, 6.85545873e-05,
       6.78617549e-05, 6.70218240e-05, 6.64953450e-05, 6.58433865e-05,
       6.50054685e-05, 6.38837828e-05, 6.29081344e-05, 6.26693521e-05,
       6.13135178e-05, 6.07608686e-05, 6.02728272e-05, 5.94850515e-05,
       5.87997056e-05, 5.82619606e-05, 5.77536736e-05, 5.75026597e-05,
       5.62728232e-05, 5.60929317e-05, 5.49943468e-05, 5.44626534e-05,
       5.39581719e-05, 5.38260803e-05, 5.31310281e-05, 5.27592102e-05,
       5.21973100e-05, 5.13744317e-05, 5.12553126e-05, 5.06478383e-05,
       5.00844177e-05, 4.93501955e-05, 4.92647947e-05, 4.82612708e-05,
       4.72140977e-05, 4.68473649e-05, 4.61151761e-05, 4.59442191e-05,
       4.55818624e-05, 4.43711934e-05, 4.36168913e-05, 4.29144876e-05,
       4.26011409e-05, 4.24296094e-05, 4.18206054e-05, 4.11144641e-05,
       4.04988438e-05, 4.04029594e-05, 4.01015274e-05, 3.83703745e-05,
       3.79170304e-05, 3.76753121e-05, 3.74813570e-05, 3.72080430e-05,
       3.64897111e-05, 3.60866387e-05, 3.53090549e-05, 3.51349536e-05,
       3.47655497e-05, 3.44335233e-05, 3.40203237e-05, 3.37243003e-05,
       3.32093564e-05, 3.31208467e-05, 3.21412705e-05, 3.15566243e-05,
       3.12964929e-05, 3.08217792e-05, 3.01572208e-05, 2.98200940e-05,
       2.94476550e-05, 2.90894814e-05, 2.88892248e-05, 2.84426321e-05,
       2.82594076e-05, 2.78967823e-05, 2.68922180e-05, 2.66827861e-05,
       2.56238145e-05, 2.55681319e-05, 2.54091853e-05, 2.48281586e-05,
       2.47355902e-05, 2.43807400e-05, 2.42382680e-05, 2.40330977e-05,
       2.36180668e-05, 2.30583790e-05, 2.26794263e-05, 2.23684088e-05,
       2.17724416e-05, 2.13798456e-05, 2.12153679e-05, 2.11152854e-05,
       2.07851172e-05, 2.06477626e-05, 2.03117057e-05, 1.99614129e-05,
       1.97835385e-05, 1.94761131e-05, 1.89066352e-05, 1.88458128e-05,
       1.85580652e-05, 1.81787242e-05, 1.78865726e-05, 1.77108085e-05,
       1.74304017e-05, 1.72514805e-05, 1.61139955e-05, 1.58081384e-05,
       1.56945648e-05, 1.56593373e-05, 1.52931470e-05, 1.51817393e-05,
       1.50089148e-05, 1.47441220e-05, 1.44596323e-05, 1.43003593e-05,
       1.42052714e-05, 1.38929736e-05, 1.38569022e-05, 1.36791660e-05,
       1.34103769e-05, 1.32478815e-05, 1.31665557e-05, 1.25216577e-05,
       1.23306897e-05, 1.23143056e-05, 1.22392185e-05, 1.18470995e-05,
       1.16582089e-05, 1.16300724e-05, 1.15909253e-05, 1.14678450e-05,
       1.10729133e-05, 1.09294401e-05, 1.06581560e-05, 1.04485923e-05,
       1.03734264e-05, 1.03039169e-05, 9.85828293e-06, 9.49977979e-06,
       9.29870702e-06, 9.18567847e-06, 9.06529763e-06, 8.97446419e-06,
       8.91525355e-06, 8.57253818e-06, 8.51033741e-06, 8.18911115e-06,
       8.00253458e-06, 7.90484671e-06, 7.71865758e-06, 7.67024080e-06,
       7.44964443e-06, 7.39801923e-06, 7.27733058e-06, 7.05396957e-06,
       6.93978440e-06, 6.68594350e-06, 6.62761346e-06, 6.47609424e-06,
       6.44052795e-06, 6.08193709e-06, 6.02389863e-06, 5.83558228e-06,
       5.66293660e-06, 5.53890794e-06, 5.43930825e-06, 5.33622499e-06,
       5.20355108e-06, 5.17964590e-06, 5.10443954e-06, 5.00824157e-06,
       4.92763488e-06, 4.80114108e-06, 4.75510301e-06, 4.53930170e-06,
       4.35932679e-06, 4.28365458e-06, 4.21179602e-06, 4.04870947e-06,
       3.99717942e-06, 3.97306836e-06, 3.83255918e-06, 3.81301603e-06,
       3.77969020e-06, 3.55692331e-06, 3.40942146e-06, 3.37260841e-06,
       3.26669849e-06, 3.12204702e-06, 3.04952023e-06, 3.03881292e-06,
       2.98708906e-06, 2.83049255e-06, 2.70911629e-06, 2.66480551e-06,
       2.61840024e-06, 2.60619222e-06, 2.54442560e-06, 2.51639178e-06,
       2.44946063e-06, 2.36887034e-06, 2.31100440e-06, 2.26502791e-06,
       2.23470204e-06, 2.19711710e-06, 2.13734280e-06, 2.05838450e-06,
       1.94994618e-06, 1.87678350e-06, 1.78273495e-06, 1.76236299e-06,
       1.75750490e-06, 1.65331239e-06, 1.60778556e-06, 1.59117332e-06,
       1.57452615e-06, 1.48016712e-06, 1.43458910e-06, 1.39282403e-06,
       1.38289913e-06, 1.28228842e-06, 1.27857777e-06, 1.23930942e-06,
       1.20641201e-06, 1.18137630e-06, 1.13585335e-06, 1.08274212e-06,
       1.06559794e-06, 1.00410298e-06, 9.46188614e-07, 9.21884750e-07,
       8.72395784e-07, 8.61663770e-07, 8.36293489e-07, 8.27512338e-07,
       7.99545480e-07, 7.82822304e-07, 7.65897330e-07, 7.18126792e-07,
       6.98537214e-07, 6.95459660e-07, 6.91682829e-07, 6.53324249e-07,
       6.38094022e-07, 6.20802128e-07, 6.00107597e-07, 5.54533139e-07,
       5.44160769e-07, 5.19432084e-07, 5.11079815e-07, 4.95744841e-07,
       4.90011066e-07, 4.81912032e-07, 3.79703221e-07, 3.72190443e-07,
       3.62763214e-07, 3.52954381e-07, 3.23837840e-07, 3.23511964e-07,
       3.17446979e-07, 3.06885792e-07, 3.00230515e-07, 2.81547407e-07,
       2.48534247e-07, 2.45784501e-07, 2.33911869e-07, 2.30243100e-07,
       2.19018717e-07, 2.05624555e-07, 1.97663263e-07, 1.89810489e-07,
       1.89405730e-07, 1.86077809e-07, 1.75320734e-07, 1.70914908e-07,
       1.65198969e-07, 1.12310192e-07, 1.09654469e-07, 9.06587795e-08,
       8.72785660e-08, 7.50262391e-08, 7.48048733e-08, 6.89133157e-08,
       6.81449434e-08, 5.14838724e-08, 4.80326922e-08, 4.60969153e-08,
       4.27106616e-08, 3.94060178e-08, 3.80199097e-08, 3.23771252e-08,
       2.35171960e-08, 2.21403398e-08, 2.06834620e-08, 1.97134213e-08,
       1.93395216e-08, 1.76272370e-08, 1.70960746e-08, 1.51962633e-08,
       1.31700356e-08, 1.18156375e-08, 9.80662761e-09, 7.42553028e-09,
       7.07545401e-09, 5.75569441e-09, 3.81252053e-09, 3.30953194e-09,
       2.01269723e-09, 1.85705367e-09, 8.67397773e-10, 1.62308331e-10,
       1.06413966e-10, 8.86919537e-11, 1.75779111e-11, 1.60149185e-11,
       1.92342591e-32, 3.44266133e-33, 2.50083816e-33, 1.74793369e-33,
       1.16111222e-33, 1.10249387e-33, 1.06440690e-33, 6.57963803e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 4.01309170e-34,
       4.01309170e-34, 4.01309170e-34, 4.01309170e-34, 3.98574747e-34,
       2.22099514e-34, 1.83572922e-34, 1.51304027e-34, 2.46588810e-35])
In [6]:
a = []
s = 0
a.append([0,(1-s)*100,'Percentage varience lost is :'+str((1-s)*100)+'%'])
for i in range(len(pca.explained_variance_ratio_)):
    s+=pca.explained_variance_ratio_[i]
    a.append([i+1,(1-s)*100,'Percentage varience lost is : '+
              str((((1-s)*100)//0.0001)/10000)+'%'])
arr = pd.DataFrame(a)
arr = arr.rename(columns = {0:'No of components used:', 
                            1:'Total varience lost (in percentage)'} )
px.line(data_frame = arr,x = 'No of components used:',
        y = 'Total varience lost (in percentage)',
        range_x = [0,784],range_y = [0,100],hover_name = 2,
        title = 'Graph depicting the loss in varience as we reduce the number of components.')

This graph depicts how the loss in variance decreases as we increase the number of components.

  1. We can see that using only 100 components we can retain almost 92% varaiance in the data
  2. As we increase the number of components the variance retained increases rapidly at first and then slowly afterwords.
  3. If we keep increasing the number of components, eventually the variance loss becomes 0 at 784 components.
  4. We can see that if we use 300 components rather than 784 we can still retain 98.7% of the total variance therefore I have used 300 components for creating the model.
  5. If you want a more intuitive feel of how the PCA would transform the dataset when using different numbers of components, check out my other notebook with animated charts for PCA here!

Visualising the effect of PCA over input images.

Creating the image matrix for the dataset.

In [7]:
components = 300
pca = PCA(n_components = components).fit(df_train_x)
numpy_train_x = df_train_x.to_numpy()
pca_trans = pca.transform(numpy_train_x)
pca_invtrans = pca.inverse_transform(pca_trans)
row = 10
column = 7

for i in range(row):
    for j in range(column):
        if j ==0:
            a = numpy_train_x[0+(i*column)].reshape(28,28)
            a = np.pad(a, pad_width=1, mode='constant', constant_values=400)
            b = pca_invtrans[0+(i*column)].reshape(28,28)
            b = np.pad(b, pad_width=1, mode='constant', constant_values=450)
            stack = np.hstack((a,b))
        else:
            a = numpy_train_x[j+(i*column)].reshape(28,28)
            a = np.pad(a, pad_width=1, mode='constant', constant_values=400)
            b = pca_invtrans[j+(i*column)].reshape(28,28)
            b = np.pad(b, pad_width=1, mode='constant', constant_values=450)
            stack = np.hstack((stack,a))
            stack = np.hstack((stack,b))
    if i ==0:
        final = stack
    else:
        final = np.vstack((final,stack))
final = np.pad(final,pad_width=2, mode='constant', constant_values=500)
img = final

Creating matrix of labels for the plot.

In [8]:
a = df_train_y['label'][0:row*column].to_numpy()
label = []
for i in a:
    label.append("The Label for the digit is: "+str(i))
final = []
border = ['Border']*604
final.append(border)
final.append(border)
for i in range(row):
    final.append(border)
    a = ['Border','Border']
    for j in range(column):
        for k in range(2):
            a.append('Border')
            for l in range(28):
                a.append(label[i*column+j])
            a.append('Border')
    a.append('Border')
    a.append('Border')
    for i in range(28):
        final.append(a)
    final.append(border)
final.append(border)
final.append(border)
label = final

Plotting the image matrix.

In [9]:
fig = go.Figure(data = go.Heatmap(z = img,colorbar = None,
                                  colorscale = [[0,'black'],[0.7,'white'],
                                                [0.8,'red'],[0.9,'blue'],
                                                [1.0,'rgb(255,0,255)']],
                                  zmin = 0,zmax = 500,zauto = False,
                                  hovertext = label))
fig['layout']['yaxis']['autorange'] = "reversed"
fig.update_layout(title = 'The Distortion induced due to PCA while using '+
                  str(components)+' components.',
                  height  = 600,width = 1100,yaxis_tickvals = [0],
                  yaxis_ticktext =[' '],xaxis_tickvals = [0],
                  xaxis_ticktext =[' '],
                  xaxis_title = 'The Original images have a red border while a blue one has been used for their PCA transforms.')
fig.update_traces(showscale = False)
fig.show()

We can therefore see that the images are very similar with no significant distortion and in some cases it might even be difficult to spot these distortions with the naked eye.

Machine Learning Model.

Splitting data into train and test set

In [10]:
x_train,x_test,y_train,y_test = train_test_split(pca.transform(numpy_train_x),df_train_y,test_size = 0.1)

Testing accuracy of the model.

In [11]:
knn = KNN.KNeighborsClassifier(n_jobs = -1,n_neighbors = 3,algorithm = 'ball_tree')
knn.fit(x_train,y_train.to_numpy().ravel())
pred = knn.predict(x_test)
pred
Out[11]:
array([1, 8, 0, ..., 7, 9, 4])
In [12]:
y_test_np = y_test.to_numpy().ravel()
score=0
for i in range(len(y_test)):
    if pred[i] == y_test_np[i]:
        score = score+1
score /=len(y_test)
print(str(score*100))
96.69047619047619

Plotting the confusion matrix

In [13]:
predictions = pred
y_test_np = y_test.to_numpy()
classes = [0,1,2,3,4,5,6,7,8,9]


confusion_mat = np.zeros((len(classes),len(classes)))
for i in range(len(predictions)):
    confusion_mat[classes.index(predictions[i])][classes.index(y_test_np[i])]+=1
confusion_mat = confusion_mat.T
confusion_mat_norm = confusion_mat/len(y_test_np)
confusion_mat_norm = (confusion_mat_norm//0.0001)/10000

fig = ff.create_annotated_heatmap(confusion_mat_norm, x=classes, y=classes, 
                                  annotation_text=confusion_mat_norm,
                                  colorscale='Viridis',text = confusion_mat,
                                  hovertemplate='Expected Value: %{y}<br>'+
                                                'Predicted Value: %{x}<br>'+
                                                'No. of datapoints in this category are: %{text}<extra></extra>')
fig.update_layout(title_text='<b>Confusion Matrix for the dataset:</b>',
                  xaxis = {'title':'Predicted Values'},width = 900,
                  yaxis = {'title':'Expected Values','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

We can see that the confusion matrix shows higher values for digits that look similar and might be confusiong to distinguish between and lower values for the digits that are easy to distinguish between.

Retraining the model over the whole dataset.

In [14]:
knn.fit(pca.transform(numpy_train_x),df_train_y)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

Out[14]:
KNeighborsClassifier(algorithm='ball_tree', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=3, p=2,
                     weights='uniform')

Predicting output over the testset.

Reading test file

In [15]:
df_test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')
df_test.describe()
Out[15]:
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
count 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 ... 28000.000000 28000.000000 28000.000000 28000.000000 28000.000000 28000.0 28000.0 28000.0 28000.0 28000.0
mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.164607 0.073214 0.028036 0.011250 0.006536 0.0 0.0 0.0 0.0 0.0
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 5.473293 3.616811 1.813602 1.205211 0.807475 0.0 0.0 0.0 0.0 0.0
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 253.000000 254.000000 193.000000 187.000000 119.000000 0.0 0.0 0.0 0.0 0.0

8 rows × 784 columns

Processing of test set

Applying PCA transform

In [16]:
np_test = pca.transform(df_test.to_numpy())

Predicting over test set

In [17]:
df_test['label'] = knn.predict(np_test)
In [18]:
a = []
for i in range(28000):
    a.append(i+1)
df_test['ImageId'] = a
df_test.describe()
Out[18]:
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 label ImageId
count 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 28000.0 ... 28000.000000 28000.000000 28000.000000 28000.0 28000.0 28000.0 28000.0 28000.0 28000.000000 28000.000000
mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.028036 0.011250 0.006536 0.0 0.0 0.0 0.0 0.0 4.414321 14000.500000
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.813602 1.205211 0.807475 0.0 0.0 0.0 0.0 0.0 2.896728 8083.048105
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 1.000000
25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 2.000000 7000.750000
50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 4.000000 14000.500000
75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 7.000000 21000.250000
max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 193.000000 187.000000 119.000000 0.0 0.0 0.0 0.0 0.0 9.000000 28000.000000

8 rows × 786 columns

Exporting output to csv

In [19]:
df_test[['ImageId','label']].to_csv('submission.csv',index=False)